Online Multi-Task Gradient Temporal-Difference Learning

نویسندگان

  • Vishnu Purushothaman Sreenivasan
  • Haitham Bou-Ammar
  • Eric Eaton
چکیده

Reinforcement learning (RL) is an essential tool in designing autonomous systems, yet RL agents often require extensive experience to achieve optimal behavior. This problem is compounded when an RL agent is required to learn policies for different tasks within the same environment or across multiple environments. In such situations, learning task models jointly rather than independently can significantly improve model performance (Wilson et al. 2007; Lazaric and Ghavamzadeh 2010). While this approach, known as multi-task learning (MTL), has achieved impressive performance, the performance comes at a high computational cost when learning new tasks or when updating previously learned models. Recent work (Ruvolo and Eaton 2013) in the supervised setting has shown that online MTL can achieve nearly identical performance to batch MTL with large computational speedups. Building upon this approach, which is known as the Efficient Lifelong Learning Algorithm (ELLA), we develop an online MTL formulation of model-based gradient temporal-difference (GTD) reinforcement learning (Sutton, Szepesvári, and Maei 2008). We call the proposed algorithm GTD-ELLA. Our approach enables an autonomous RL agent to accumulate knowledge over its lifetime and efficiently share this knowledge between tasks to accelerate learning. Rather than learning a policy for an RL task tabula rasa, as in standard GTD, our approach rapidly learns a high performance policy by building upon the agent’s previously learned knowledge. Our preliminary results on controlling different mountain car tasks demonstrates that GTD-ELLA significantly improves learning over standard GTD RL.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-step Predictions Based on TD-DBP ELMAN Neural Network for Wave Compensating Platform

The gradient descent momentum and adaptive learning rate TD-DBP algorithm can improve the training speed and stability of Elman network effectively. BP algorithm is the typical supervised learning algorithm, so neural network cannot be trained on-line by it. For this reason, a new algorithm (TDDBP), which was composed of temporal difference (TD) method and dynamic BP algorithm (DBP), was propos...

متن کامل

Multi-Timescale, Gradient Descent, Temporal Difference Learning with Linear Options

Deliberating on large or continuous state spaces have been long standing challenges in reinforcement learning. Temporal Abstraction have somewhat made this possible, but efficiently planing using temporal abstraction still remains an issue. Moreover using spatial abstractions to learn policies for various situations at once while using temporal abstraction models is an open problem. We propose ...

متن کامل

Natural actor-critic algorithms

We present four new reinforcement learning algorithms based on actor–critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor–critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochasti...

متن کامل

Incremental Natural Actor-Critic Algorithms

We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods...

متن کامل

Online Multi-Task Learning for Policy Gradient Methods

Policy gradient algorithms have shown considerable recent success in solving high-dimensional sequential decision making tasks, particularly in robotics. However, these methods often require extensive experience in a domain to achieve high performance. To make agents more sampleefficient, we developed a multi-task policy gradient method to learn decision making tasks consecutively, transferring...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014